Introduction

Background

Multiple sclerosis (MS) is a chronic neuroinflammatory autoimmune disease. MS is characterized by demyelination and subsequent formation of lesions through out the central nervous system (CNS). Grey matter (GM) and white matter (WM) lesions in the brain show pathological differences. Microglia are CNS-resident and can function as antigen-presenting cells and phagocytes. Their role in MS is complex and controversial (Luo 2017). In this paper by van der Poel et. al, titled “Transcriptional profiling of human microglia reveals grey-white matter heterogeneity and multiple sclerosis-associated changes”, a total of 31 human microglia samples were analyzed from 10 MS donors (MS), including 5 grey matter (GM) and 10 white matter (WM) samples, and from 11 non-neurological control donors (CON), including 5 grey matter and 11 white matter samples. From each sample, the total RNA was extracted (Poel 2019).

Objectives

The authors isolated human microglia according to the previously described general experimental design. They suggest that pathological changes in MS tissue may be associated with changes in microglia, leading to lesion initiation. The authors wanted to analyze the transcriptional profile of microglia in the collected samples by RNA-sequencing (Poel 2019). Ultimately, they wanted to discover MS-related changes in microglia between the GM and WM brain regions.

Summary of previous analysis

In A1, we did data cleaning, normalization, and mapped HUGO symbols. We initially had 21283 genes and after filtering out those with low counts (< 1 count/million), we were left with 15076 genes. TMM was used to normalize the counts. Our genes were already mapped to HUGO symbols. With an MDS plot, we saw that our samples clustered together by tissue, WM or GM, regardless of whether the samples were from MS or CON patients.

knitr::include_graphics("~/projects/A3/figures/a1_mdsplot.png")

Figure 1. MDS plot of all MS and CON patient samples. Samples clustered together by tissue regardless of patient group.

In A2, we did differential expression analysis and preliminary over-representation analysis (ORA). In the ORA, with comparison of the two tissues, GM and WM - we found 3042 genes that were differentially expressed in MS patients, and 2901 genes that were differentially expressed in CON patients using the conventional p-value of 0.05. This was a high number, so we reduced the p-value to 0.01.There 1621 genes for the MS patients and 1498 genes for the CON patients.

knitr::include_graphics("~/projects/A3/figures/a2_heatmap.png")

Dependencies and data preparation

Figure 2. Heat map of global gene expression across all samples.Samples were annotated based on their patient group and tissue on top of each column, its legend is found on the right. The heat map was coloured based on expression of each gene, where red is up-regulated and blue is down-regulated. Values were converted from counts to CPM and scaled. The samples were also hierarchically clustered by their expression (dendrograms for both rows and columns were excluded to keep the figure neat).

knitr::include_graphics("~/projects/A3/figures/a2_volcanoms.png")

knitr::include_graphics("~/projects/A3/figures/a2_volcanocon.png")

Figure 3. Volcano plot of differentially expressed genes comparing (A) grey matter and white matter tissues in MS patients, and (B) grey matter and white matter tissues in CON patients. Differentially expressed genes are coloured based on whether they are up- or down-regulated. Genes were selected when FDR < 0.05 AND logFC > 0 for up-regulated OR logFC < 0 for down-regulated. If they do not fulfill these requirements, they are labelled as not differentially expressed (DE).

Packages

if (! requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
if (! requireNamespace("limma", quietly = TRUE)) {
  BiocManager::install("limma")
}
if (! requireNamespace("knitr", quietly = TRUE)) {
  install.packages("knitr")
}
if (! requireNamespace("dplyr", quietly = TRUE)) {
  install.packages("dplyr")
}
if (! requireNamespace("edgeR", quietly = TRUE)) {
  BiocManager::install("edgeR")
}
if (! requireNamespace("ComplexHeatmap", quietly = TRUE)) {
  BiocManager::install("ComplexHeatmap")
}
if (! requireNamespace("circlize", quietly = TRUE)) {
  BiocManager::install("circlize")
}
if (!requireNamespace("gprofiler2", quietly = TRUE)) {
    install.packages("gprofiler2")
}
if (!requireNamespace("tidyverse", quietly = TRUE)) {
    install.packages("tidyverse")
}
if (! requireNamespace("fgsea", quietly = TRUE)) {
  BiocManager::install("fgsea")
}

library("Biobase")
library("knitr")
library("edgeR")
library("limma")
library("dplyr")
library("ggplot2")
library("ComplexHeatmap")
library("circlize")
library("gprofiler2")
library("RColorBrewer")
library("kableExtra")
library("stringr")

Load data

We will be loading the results of our DE analysis from A2 as our input for this assignment.

# Comparison between GM and WM from MS
deg_con <- read.csv("~/projects/A3/data/results_con.csv", sep = ";", row.names = 1,
    header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
library(tidyverse)
deg_con <- deg_con %>%
    mutate_all(funs(parse_number(str_replace(., ",", "."))))
# Comparison between GM and WM from CON
deg_ms <- read.csv("~/projects/A3/data/results_ms.csv", sep = ";", row.names = 1,
    header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
deg_ms <- deg_ms %>%
    mutate_all(funs(parse_number(str_replace(., ",", "."))))

The loaded data was produced with the topTags function from the edgeR package - consisting of the log FC, log CPM, F-value, p-value, and adjusted p-value (FDR) for each DE gene.

Task 1: Non-thresholded GSEA

We will perform GSEA on the the DE genes between GM and WM for each patient group separately. We will then compare our GSEA results with the ORA results we previously got from A2.

GSEA is a non-threshold method - since we use a ranked list of genes without an arbitrary threshold, we can perform an assessment of all genes in the list. GSEA is more consistent in identifying enriched gene sets and pathways, making it a more preferable method over ORA for analysis of two sets of DE genes under two different conditions (Abatangelo and Ancona 2009). In this report, the two different conditions will be the patient groups (MS/CON).

We will make a ranked list of genes to provide as an input into GSEA. A gene’s rank is calculated by the negative log10 of its p-value by the sign of the logFC.

GSEA is performed using the GSEA software Version 4.3.2, using the predefined GSEAPreranked mode. We performed GSEA analysis with the following parameters:

  • Number of permutations: 1000
  • Max size: 200
  • Min size: 15

Max size and Min size represent the upper and lower bounds of the gene set size that is included in the analysis. This allows us to identify pathways that are specific enough for later interpretation. Due to the nature of our chosen data set, we are interested in understanding the biological processes that may be involved in this process. We used the most recent version (March 2023) of the Human_GOBP_AllPathways_no_GO_iea_March_02_2023_symbol.gmt gene set from the Bader Lab at the University of Toronto. This can be found here. We specifically chose the gene set that has no annotations from electronic annotations (IEA). This annotation file curates all GO: Biological Process terms as well as pathways from multiple pathway databases like Reactome (Vastrik et al. 2007) and MSigDB (Liberzon et al. 2011).

GSEA of MS patient samples

Ranked Gene List

# Just going to name the gene column
deg_ms$Gene <- rownames(deg_ms)
# Constructing ranked gene list
deg_ms$rank <- (-log(deg_ms$PValue, base = 10)) * sign(deg_ms$logFC)
deg_ms_ranked <- dplyr::select(deg_ms, Gene, rank) %>%
    arrange(desc(rank)) %>%
    dplyr::select(Gene, rank)
# Save the ranked gene list
write.table(deg_ms_ranked, file = "~/projects/A3/data/deg_ms.rnk", sep = "\t", col.name = TRUE,
    row.names = FALSE, quote = FALSE)

GSEA

Configuration

# Defining paths
gsea_exe <- "~/projects/GSEA_4.3.2/gsea-cli.sh"
gsea_dir <- "~/projects/A3/data/MS"
gsea_gmt <- "~/projects/A3/data/Human_GOBP_AllPathways_no_GO_iea_March_02_2023_symbol.gmt"
gsea_rnk <- "~/projects/A3/data/deg_ms.rnk"
# Permutations
perm <- 1000
# Max size
max_size <- 200
# Min size
min_size <- 15
# Analysis name
analysis_name <- "GMvsWM_MS"

Download the GMT file if not already downloaded.

if (!file.exists(gsea_gmt)) {
    gmt_url <- "http://download.baderlab.org/EM_Genesets/March_02_2023/Human/symbol/Human_GOBP_AllPathways_no_GO_iea_March_02_2023_symbol.gmt"
    download.file(gmt_url, destfile = gsea_gmt)
}

Run GSEA.

cmd <- paste(gsea_exe, "GSEAPreranked", 
            "-gmx", gsea_gmt, 
            "-collapse false", 
            "-nperm", perm, 
            "-rnk", gsea_rnk,
            "-scoring_scheme weighted",
            "-rpt_label", analysis_name,
            "-plot_top_x 20 -rnd_seed 12345",
            "-set_max", max_size, 
            "-set_min", min_size, 
            "-zip_report false", 
            "-out", gsea_dir)
path1 <- "~/projects/A3/data/MS/GMvsWM_MS.GseaPreranked.1680973644388"
if (!file.exists(path1)) {
    system(cmd)
}

Results

ms_up_reg <- read.table(file = "~/projects/A3/data/MS/GMvsWM_MS.GseaPreranked.1680973644388/gsea_report_for_na_pos_1680973644388.tsv",
    sep = "\t", header = TRUE, fill = TRUE)
ms_neg_reg <- read.table(file = "~/projects/A3/data/MS/GMvsWM_MS.GseaPreranked.1680973644388/gsea_report_for_na_neg_1680973644388.tsv",
    sep = "\t", header = TRUE, fill = TRUE)

Upregulated

# Prepare for display
MSupreg_table <- ms_up_reg %>%
    dplyr::select(NAME, SIZE, ES, FDR.q.val) %>%
    dplyr::rename(`Gene Set` = NAME, Size = SIZE, `Enrichment Score` = ES, `FDR q-value` = FDR.q.val) %>%
    # Format NAME
dplyr::mutate(`Gene Set` = stringr::str_remove(`Gene Set`, "%.*$")) %>%
    dplyr::mutate(`Gene Set` = stringr::str_to_sentence(`Gene Set`)) %>%
    # Sort the table by FDR q-value
dplyr::arrange(`FDR q-value`)
# Display the table
DT::datatable(MSupreg_table, rownames = FALSE, filter = "top", options = list(ppageLength = 10,
    scrollX = TRUE)) %>%
    # Changing the font size of content
DT::formatStyle(names(MSupreg_table), fontSize = "12px")


Table 1. Enriched gene sets of the upregulated genes in GM vs WM in the MS patient group.

Downregulated

# Prepare for display
MSnegreg_table <- ms_neg_reg %>%
    dplyr::select(NAME, SIZE, ES, FDR.q.val) %>%
    dplyr::rename(`Gene Set` = NAME, Size = SIZE, `Enrichment Score` = ES, `FDR q-value` = FDR.q.val) %>%
    # Format NAME
dplyr::mutate(`Gene Set` = stringr::str_remove(`Gene Set`, "%.*$")) %>%
    dplyr::mutate(`Gene Set` = stringr::str_to_sentence(`Gene Set`)) %>%
    # Sort the table by FDR q-value
dplyr::arrange(`FDR q-value`)
# Display the table
DT::datatable(MSnegreg_table, rownames = FALSE, filter = "top", options = list(ppageLength = 10,
    scrollX = TRUE)) %>%
    # Changing the font size of content
DT::formatStyle(names(MSnegreg_table), fontSize = "12px")


Table 2. Enriched gene sets of the downregulated genes in GM vs WM in the MS patient group.

Notably, our GSEA analysis identified 904 gene sets that were associated with the upregulated genes in GM vs WM in MS patients but r sum(MSupreg_table$`FDR q-value` < 0.25) were significantly enriched (FDR < 0.25). While there were 4089 gene sets enriched, and r sum(MSnegreg_table$`FDR q-value` < 0.25) of these gene sets were significant (FDR < 0.25) in the downregulated genes in GM vs WM in MS patients.

We can look at the top 5 gene sets over-represented at the top of the ranked gene list.

MSup_res <- data.frame(pathway = MSupreg_table$`Gene Set`, size = MSupreg_table$Size,
    FDR = MSupreg_table$`FDR q-value`, es = MSupreg_table$`Enrichment Score`)

ggplot(MSup_res[1:5, ]) + geom_point(aes(x = es, color = FDR, y = pathway, size = size)) +
    theme(axis.title.x = element_text(), axis.title.y = element_text()) + scale_color_gradient(low = "red",
    high = "blue") + labs(x = "Enrichment Score", color = "FDR", size = "Genesize set",
    y = NULL, title = "Top enriched gene sets in upregulated 
       genes in GM vs WM of MS patients")

Figure 1. Top 5 enriched gene sets identified by GSEA for the up-regulated genes. Dots are are coloured by FDR and their size is determined by the size of the gene set.

MSneg_res <- data.frame(pathway = MSnegreg_table$`Gene Set`, size = MSnegreg_table$Size,
    FDR = MSnegreg_table$`FDR q-value`, es = MSnegreg_table$`Enrichment Score`)

ggplot(MSneg_res[1:5, ]) + geom_point(aes(x = es, color = FDR, y = pathway, size = size)) +
    theme(axis.title.x = element_text(), axis.title.y = element_text()) + scale_color_gradient(low = "red",
    high = "blue") + labs(x = "Enrichment Score", color = "FDR", size = "Genesize set",
    y = NULL, title = "Top enriched gene sets in downregulated 
       genes in GM vs WM of MS patients")

Figure 2. Top 5 enriched gene sets identified by GSEA for the up-regulated genes. Dots are are coloured by FDR and their size is determined by the size of the gene set.

GSEA of CON patient samples

Ranked Gene List

# Just going to name the gene column
deg_con$Gene <- rownames(deg_con)
# Constructing ranked gene list
deg_con$rank <- (-log(deg_con$PValue, base = 10)) * sign(deg_con$logFC)
deg_con_ranked <- dplyr::select(deg_con, Gene, rank) %>%
    arrange(desc(rank)) %>%
    dplyr::select(Gene, rank)
# Save the ranked gene list
write.table(deg_con_ranked, file = "~/projects/A3/data/deg_con.rnk", sep = "\t",
    col.name = TRUE, row.names = FALSE, quote = FALSE)

GSEA

Configuration

# Defining paths
gsea_exe <- "~/projects/GSEA_4.3.2/gsea-cli.sh"
gsea_dir <- "~/projects/A3/data/CON"
gsea_gmt <- "~/projects/A3/data/Human_GOBP_AllPathways_no_GO_iea_March_02_2023_symbol.gmt"
gsea_rnk <- "~/projects/A3/data/deg_con.rnk"
# Permutations
perm <- 1000
# Max size
max_size <- 200
# Min size
min_size <- 15
# Analysis name
analysis_name <- "GMvsWM_CON"

Download the GMT file if not already downloaded.

if (!file.exists(gsea_gmt)) {
    gmt_url <- "http://download.baderlab.org/EM_Genesets/March_02_2023/Human/symbol/Human_GOBP_AllPathways_no_GO_iea_March_02_2023_symbol.gmt"
    download.file(gmt_url, destfile = gsea_gmt)
}

Run GSEA.

cmd <- paste(gsea_exe, "GSEAPreranked", 
            "-gmx", gsea_gmt, 
            "-collapse false", 
            "-nperm", perm, 
            "-rnk", gsea_rnk,
            "-scoring_scheme weighted",
            "-rpt_label", analysis_name,
            "-plot_top_x 20 -rnd_seed 12345",
            "-set_max", max_size, 
            "-set_min", min_size, 
            "-zip_report false", 
            "-out", gsea_dir)
path2 <- "~/projects/A3/data/CON/GMvsWM_CON.GseaPreranked.1680999322997"
if (!file.exists(path2)) {
    system(cmd)
}

Results

con_up_reg <- read.table(file = "~/projects/A3/data/CON/GMvsWM_CON.GseaPreranked.1680999322997/gsea_report_for_na_neg_1680999322997.tsv",
    sep = "\t", header = TRUE, fill = TRUE)
con_neg_reg <- read.table(file = "~/projects/A3/data/CON/GMvsWM_CON.GseaPreranked.1680999322997/gsea_report_for_na_pos_1680999322997.tsv",
    sep = "\t", header = TRUE, fill = TRUE)

Upregulated

# Prepare for display
CONupreg_table <- con_up_reg %>%
    dplyr::select(NAME, SIZE, ES, FDR.q.val) %>%
    dplyr::rename(`Gene Set` = NAME, Size = SIZE, `Enrichment Score` = ES, `FDR q-value` = FDR.q.val) %>%
    # Format NAME
dplyr::mutate(`Gene Set` = stringr::str_remove(`Gene Set`, "%.*$")) %>%
    dplyr::mutate(`Gene Set` = stringr::str_to_sentence(`Gene Set`)) %>%
    # Sort the table by FDR q-value
dplyr::arrange(`FDR q-value`)
# Display the table
DT::datatable(CONupreg_table, rownames = FALSE, filter = "top", options = list(ppageLength = 10,
    scrollX = TRUE)) %>%
    # Changing the font size of content
DT::formatStyle(names(CONupreg_table), fontSize = "12px")


Table 3. Enriched gene sets of the upregulated genes in GM vs WM in the CON patient group.

Downregulated

# Prepare for display
CONnegreg_table <- con_neg_reg %>%
    dplyr::select(NAME, SIZE, ES, FDR.q.val) %>%
    dplyr::rename(`Gene Set` = NAME, Size = SIZE, `Enrichment Score` = ES, `FDR q-value` = FDR.q.val) %>%
    # Format NAME
dplyr::mutate(`Gene Set` = stringr::str_remove(`Gene Set`, "%.*$")) %>%
    dplyr::mutate(`Gene Set` = stringr::str_to_sentence(`Gene Set`)) %>%
    # Sort the table by FDR q-value
dplyr::arrange(`FDR q-value`)
# Display the table
DT::datatable(CONnegreg_table, rownames = FALSE, filter = "top", options = list(ppageLength = 10,
    scrollX = TRUE)) %>%
    # Changing the font size of content
DT::formatStyle(names(CONnegreg_table), fontSize = "12px")


Table 4. Enriched gene sets of the downregulated genes in GM vs WM in the MS patient group.

Our GSEA analysis identified 3997 gene sets that were associated with the upregulated genes in GM vs WM in MS patients but r sum(CONupreg_table$`FDR q-value` < 0.25) were significantly enriched (FDR < 0.25). While there were 996 gene sets enriched, and r sum(CONnegreg_table$`FDR q-value` < 0.25) of these gene sets were significant (FDR < 0.25) in the downregulated genes in GM vs WM in MS patients.

We can look at the top 5 gene sets over-represented at the top of the ranked gene list.

CONup_res <- data.frame(pathway = CONupreg_table$`Gene Set`, size = CONupreg_table$Size,
    FDR = CONupreg_table$`FDR q-value`, es = CONupreg_table$`Enrichment Score`)

ggplot(CONup_res[1:5, ]) + geom_point(aes(x = es, color = FDR, y = pathway, size = size)) +
    theme(axis.title.x = element_text(), axis.title.y = element_text()) + scale_color_gradient(low = "red",
    high = "blue") + labs(x = "Enrichment Score", color = "FDR", size = "Genesize set",
    y = NULL, title = "Top enriched gene sets in upregulated 
       genes in GM vs WM of CON patients")

Figure 6. Top 5 enriched gene sets identified by GSEA for the up-regulated genes. Dots are are coloured by FDR and their size is determined by the size of the gene set.

CONneg_res <- data.frame(pathway = CONnegreg_table$`Gene Set`, size = CONnegreg_table$Size,
    FDR = CONnegreg_table$`FDR q-value`, es = CONnegreg_table$`Enrichment Score`)

ggplot(CONneg_res[1:5, ]) + geom_point(aes(x = es, color = FDR, y = pathway, size = size)) +
    theme(axis.title.x = element_text(), axis.title.y = element_text()) + scale_color_gradient(low = "red",
    high = "blue") + labs(x = "Enrichment Score", color = "FDR", size = "Genesize set",
    y = NULL, title = "Top enriched gene sets in downregulated 
       genes in GM vs WM of CON patients")

Figure 7. Top 5 enriched gene sets identified by GSEA for the up-regulated genes. Dots are are coloured by FDR and their size is determined by the size of the gene set.

1. What method did you use? What genesets did you use? Make sure to specify versions and cite your methods.

As mentioned earlier in this assignment, GSEA was performed using the GSEA software Version 4.3.2 for command line. We used the most recent version (March 2023) of the Human_GOBP_AllPathways_no_GO_iea_March_02_2023_symbol.gmt gene set from the Bader Lab at the University of Toronto. This can be found here.

2. Summarize your enrichment results.

We identified 904 gene sets that were associated with the upregulated genes in GM vs WM in MS patients but r sum(MSupreg_table$`FDR q-value` < 0.25) were significantly enriched (FDR < 0.25). For the downregulated genes, there were 4089 gene sets enriched, and r sum(MSnegreg_table$`FDR q-value` < 0.25) of these gene sets were significant (FDR < 0.25). In Fig. 4, we can see various gene sets that are involved in immunity such as Regulation of type I Interferon-mediated signaling pathway. In Fig. 5, a gene set that is particularly relevant to this data set is Hallmark tnfa signaling via nfkb. In A2, we found similar results with ORA and found that the NF-kB pathway is essential in the pathology of MS (Mc Guire and Loo 2013). It is interesting how this is downregulated in GM vs WM.

We identified 3997 gene sets that were associated with the upregulated genes in GM vs WM in MS patients but r sum(CONupreg_table$`FDR q-value` < 0.25) were significantly enriched (FDR < 0.25). For the downregulated genes, there were 996 gene sets enriched, and r sum(CONnegreg_table$`FDR q-value` < 0.25) of these gene sets were significant (FDR < 0.25). Interestingly, our results for this analysis (Fig. 6) is similar to what is seen in Fig. 5, with Hallmark tnfa signaling via nfkb in particular. For the MS group, the ES is -0.7880487 while in the CON group, the ES is -0.80684453. In Fig. 7, a gene set that is particularly relevant to this data set is Particular importance of Trna metabolic process..

3. How do these results compare to the results from the thresholded analysis in Assignment #2. Compare qualitatively. Is this a straight forward comparison? Why or why not?

In A2, we found:

  • For the MS group:
MSupFilePath <- "~/projects/A3/figures/a2_msUpTopTerms.RDS"
a2_MSupResults <- readRDS(MSupFilePath)
head(a2_MSupResults)

Table 3. Analysis from A2: Enriched gene sets from thresholded GSEA (p-value < 0.01) using upregulated genes only.

When comparing this to what we found in our non-thresholded GSEA in Fig.4, we can see that A2’s results outlined in Table 3 has more immune mechanism-related terms concerning Toll-like receptors (TLRs). Fig.4 seems to have more vague and generic terms.

MSdownFilePath <- "~/projects/A3/figures/a2_msDownTopTerms.rds"
a2_MSdownResults <- readRDS(MSdownFilePath)
head(a2_MSdownResults)

Table 4. Analysis from A2: Enriched gene sets from thresholded GSEA (p-value < 0.01) using downregulated genes only.

When comparing this to what we found in our non-thresholded GSEA in Fig.5, we can see that A2’s results outlined in Table 4 has terms related to translation and protein processes. There are a few translation-related terms in Fig. 5 but there is a particular gene set that stands out with the largest size out of the top 5: the previously mentioned Hallmark tnfa signaling via nfkb.

  • For the CON group:
CONupFilePath <- "~/projects/A3/figures/a2_conUpTopTerms.RDS"
a2_CONupResults <- readRDS(CONupFilePath)
head(a2_CONupResults)

Table 5. Analysis from A2: Enriched gene sets from thresholded GSEA (p-value < 0.01) using upregulated genes only.

When comparing this to what we found in our non-thresholded GSEA in Fig. 6, we can see that A2’s results outlined in Table 5 has more immune mechanism-related terms such as regulation of response and myeloid leukocyte mediated immunity. Similar to what we found in A2, Fig. 6 has the term Hallmark tnfa signaling via nfkb - this is similar to A2’s positive regulation of I-kappaB kinase/NF-kappaB signaling.

CONdownFilePath <- "~/projects/A3/figures/a2_conDownTopTerms.rds"
a2_CONdownResults <- readRDS(CONdownFilePath)
head(a2_CONdownResults)

Table 6. Analysis from A2: Enriched gene sets from thresholded GSEA (p-value < 0.01) using downregulated genes only.

When comparing this to what we found in our non-thresholded GSEA in Fig. 7, we can see that A2’s results outlined in Table 6 has more generic terms related to apoptosis. Fig. 7 has more specific terms such as Rho GTPases activate NADPH oxidases and Trna metabolic process.

While the results from the two different analyses are not the exact same, I do not think they are in disagreement as there are still a few similarities. These results also still align with experimental conditions. It is, however, important to note that this is NOT a straight forward comparison. Throughout this part of the assignment, we have only quantitatively compared and contrasted the top 5 terms of enriched gene sets. There might be other information that can affect our interpretation. Additionally, for the A2 results, we reduced our threshold to p-value < 0.01 as there were many terms that passed the conventional p-value < 0.05. This is unlike our non-thresholded GSEA in this assignment where we used the threshold FDR < 0.05. We also only used the annotation data GO:BP (releases/2022-12-04), Reactome (releases/2022-12-28), and WikiPathways (releases/2022-12-10) for A2. These collections are only a few of what is used in the gene set from the Bader Lab.

Task 2: Visualize the GSEA in Cytoscape

Enrichment map construction

We have attempted to run EnrichmentMap from within R but ran into issues. Instead, we used the EnrichmentMap (3.3.5) app (Merico et al. 2010) in the Cytoscape 3.9.1 software (Shannon et al. 2003). The EnrichmentMap app needs to be installed separately. The default parameters were used and edge cutoff was set to 0.375 with the Jaccard + Overlap combined metric.

knitr::include_graphics("~/projects/A3/figures/MS_network.png")

knitr::include_graphics("~/projects/A3/figures/CON_network.png")

Figure 8. (A) Enrichment map of the dysregulated gene sets and pathways in MS patients, (B) Enrichment map of the dysregulated gene sets and pathways in CON patients. Both are prior to manual layout. Each node is a gene set/pathway and is coloured either red or blue to indicated whether it is up- or down-regulated respectively. Node sizes correspond to gene set size. Edges represent overlap between nodes and the edge width represents the number of genes that overlap.

Network annotation

The networks made in Fig. 8 alone are not enough to interpret and infer the biological themes of these gene sets. We then applied the AutoAnnotate app (1.4.0) (BaderLab 2016) in Cytoscape (Shannon et al. 2003) to annotate the Enrichment Map. We manually eliminated clusters and nodes to make it more readable. The MCL (Markov Cluster) clustering algorithm was used to annotate the common labels of nodes. It is efficient and effective at detecting clusters in large networks. We used the default parameters: max word per label leaving = 3, minimum word occurrence = 1.

knitr::include_graphics("~/projects/A3/figures/annoMS_network.png")

knitr::include_graphics("~/projects/A3/figures/annoCON_network.png")

Figure 9. (A) Annotated enrichment map of the dysregulated gene sets and pathways in MS patients, (B) Annotated enrichment map of the dysregulated gene sets and pathways in CON patients. Each node is a gene set/pathway and is coloured either red or blue to indicated whether it is up- or down-regulated respectively. Node sizes correspond to gene set size. Edges represent overlap between nodes and the edge width represents the number of genes that overlap. The size of annotation text corresponds to the number of gene sets in the annotation.

To produce a publication-ready figure, we excluded some themes that were irrelevant - too generic, little/no edges between nodes.

knitr::include_graphics("~/projects/A3/figures/pubReadyAnno_network.png")

Figure 10. Enrichment Map created with the EnrichmentMap (3.3.5) app in Cytoscape 3.9.1. Network was created with parameters FDR Q-value threshold < 0.01 and combined similarity cutoff > 0.375. Annotations were made with the AutoAnnotate app (1.4.0). Themes with little/no edges and generic terms were removed. (A) Annotated enrichment map of the dysregulated gene sets and pathways in MS patients, (B) Annotated enrichment map of the dysregulated gene sets and pathways in CON patients. Each node is a gene set/pathway and is coloured either red or blue to indicated whether it is up- or down-regulated respectively. Node sizes correspond to gene set size. Edges represent overlap between nodes and the edge width represents the number of genes that overlap. The size of annotation text corresponds to the number of gene sets in the annotation.

Collapse to a theme network

Fig. 9 gives us a better idea of the biological themes of our gene sets. To better visualize these relationships, we further collapsed the clusters into single nodes by selecting the Collapse all option in the AutoAnnotate menu and used the CoSE (Compound Spring Embedder) algorithm.

knitr::include_graphics("~/projects/A3/figures/collapseAnnoMS_network.png")

knitr::include_graphics("~/projects/A3/figures/collapseAnnoCON_network.png")

Figure 11. (A) Collapsed annotated enrichment map of the dysregulated gene sets and pathways in MS patients, (B) Collapsed annotated enrichment map of the dysregulated gene sets and pathways in CON patients. CoSE cluster layout was used. Each node is a gene set/pathway and is coloured either red or blue to indicated whether it is up- or down-regulated respectively. Node sizes correspond to gene set size. Edges represent overlap between nodes and the edge width represents the number of genes that overlap. The size of annotation text corresponds to the number of gene sets in the annotation.

Fig. 11 still seems to be a bit “busy”. We excluded some of the smaller individual nodes for both networks.

knitr::include_graphics("~/projects/A3/figures/legendExcludedNetworks.png")

Figure 12. (A) Collapsed annotated enrichment map of the dysregulated gene sets and pathways in MS patients, (B) Collapsed annotated enrichment map of the dysregulated gene sets and pathways in CON patients. CoSE cluster layout was used. Each node is a gene set/pathway and is coloured either red or blue to indicated whether it is up- or down-regulated respectively. Node sizes correspond to gene set size. Edges represent overlap between nodes and the edge width represents the number of genes that overlap. The size of annotation text corresponds to the number of gene sets in the annotation. Smaller nodes from Fig. 11 were excluded to produce these figures.

Task 3: Interpretation and detailed view of results

Abatangelo, Maglietta, L., and N. Ancona. 2009. “Comparative Study of Gene Set Enrichment Methods.” BMC Bioinformatics 10 (275).
BaderLab. 2016. “Autoannotate.” https://apps.cytoscape.org/apps/autoannotate.
Gu, Z. 2016. “ComplexHeatmap.”
———. 2022. “Circlize: Circular Visualization.”
Kolberg, Raudvere, L. 2021. “Gprofiler2: Interface to the ’g:profiler’ Toolset.”
Liberzon, Arthur, Aravind Subramanian, Reid Pinchback, Helga Thorvaldsdóttir, Pablo Tamayo, and Jill P Mesirov. 2011. “Molecular Signatures Database (MSigDB) 3.0.” Bioinformatics 27 (12): 1739–40.
Luo, Jian, C. 2017. “The Role of Microglia in Multiple Sclerosis.” Neuropsychiatric Disease and Treatment 13 (1661–1667).
Mc Guire, Prinz, C., and G. van Loo. 2013. “Nuclear Factor Kappa b (NF-κB) in Multiple Sclerosis Pathology.” Trends in Molecular Medicine 19 (10).
Merico, D., R. Isserlin, O. Stueker, A. Emili, and G. D. Bader. 2010. “Enrichment Map: A Network-Based Method for Gene-Set Enrichment Visualization and Interpretation.” Genome Biology 11: R94.
Morgan, Martin. 2022. “BiocManager: Access the Bioconductor Project Package Repository.”
Neuwirth, E. 2022. “RColorBrewer: ColorBrewer Palettes.”
Poel, Ulas van der, M. 2019. “Transcriptional Profiling of Human Microglia Reveals Grey–White Matter Heterogeneity and Multiple Sclerosis-Associated Changes.” Nat Commun 10 (1139).
Robinson, McCarthy, M. D. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.”
Shannon, P., A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang, D. Ramage, N. Amin, B. Schwikowski, and T. Ideker. 2003. “Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks.” Genome Research 13: 2498–2504.
Smyth, G. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.”
Vastrik, Imre, Peter D’Eustachio, Esther Schmidt, Geeta Joshi-Tope, Gopal Gopinath, David Croft, Bernard de Bono, et al. 2007. “Reactome: A Knowledge Base of Biologic Pathways and Processes.” Genome Biology 8: 1–13.
Wickham, François, H. 2023. “Dplyr: A Grammar of Data Manipulation.”
Xie, Sarma, Y. 2023. “Knitr: A General-Purpose Package for Dynamic Report Generation in r.”
Zhu, Travison, H. 2021. “kableExtra.”